All the dependencies are listed in 'environment.yml'.
Run "conda env create --file environment.yml" for environment setup.
# For data processing & EDA
import numpy as np
import pandas as pd
import itertools
# Data viz related.
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from wordcloud import WordCloud
import plotly.graph_objects as go
import plotly.express as px
sns.set_theme(style="whitegrid")
plt.rcParams['figure.dpi'] = 150
import warnings
warnings.simplefilter('ignore')
This part includes all the preworks before answering the homework.
If you are a non-US citizen, "TSA" is highly possible an unknown word to you(me neither). Background investigation is the key to interpreting the data in the first place. Therefore, I will try to answer the questions below in this paragraph:
From wikipedia, you may find the information below:
The Transportation Security Administration (TSA) is an agency of the U.S. Department of Homeland Security that has authority over the security of the traveling public in the United States.
The TSA has multiple screening processes and regulations related to passengers and carry-on luggage, including; identification requirements, pat-downs, full-body scanners, device restrictions, and explosives screening.
In a word: TSA is an organization responsible for screening of passengers and baggage at U.S. airports for security reasons.
You can find below descriptions in this page.
If you have experienced a loss or damage to your property and you feel that this loss or damage occurred as a direct result of negligence by a TSA employee, you may file a claim with TSA. If you feel the loss or damage was due to the negligence of your air carrier, please file a claim directly with the air carrier. If filing with TSA, you must include proof of your loss or damage as well as evidence of TSA negligence.
Simply speaking, as a passenger, if you experienced a loss or damage due to the screening of TSA, you could file the claim to TSA for compensation.
The data is published at this official site.
I have listed some rows of the claim data below. Each row describes one claim: when and where it happened, what happened, and if the claim was approved or not.
pd.read_excel("data/TSA_Claims-2002-2006.xls").head(3)
| Claim Number | Date Received | Incident Date | Airport Code | Airport Name | Airline Name | Claim Type | Claim Site | Item | Claim Amount | Status | Close Amount | Disposition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0909802M | 2002-01-04 00:00:00 | 2002-12-12 00:00:00 | EWR | Newark International Airport | Continental Airlines | Property Damage | Checkpoint | Other | 350.00 | Approved | 350.00 | Approve in Full |
| 1 | 0202417M | 2002-02-02 00:00:00 | 2004-01-16 00:00:00 | SEA | Seattle-Tacoma International | NaN | Property Damage | Checked Baggage | Luggage (all types including footlockers) | 100.00 | Settled | 50.00 | Settle |
| 2 | 0202445M | 2002-02-04 00:00:00 | 2003-11-26 00:00:00 | STL | Lambert St. Louis International | American Airlines | Property Damage | Checked Baggage | Cell Phones | 278.88 | Settled | 227.92 | Settle |
We got three claims files, before union all the files, let's check if there is any header changes.
What we noticed:
# Check the columns in each file.
# Output is hidden for simplicity.
claim_list = ['TSA_Claims-2002-2006.xls', 'TSA_Claims-2007-2009.xls', 'TSA_Claims-2010-2013.xls']
for file in claim_list:
print(file)
print(pd.read_excel(f"data/{file}").info(memory_usage = False))
print("__________________________")
TSA_Claims-2002-2006.xls <class 'pandas.core.frame.DataFrame'> RangeIndex: 97231 entries, 0 to 97230 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Claim Number 97231 non-null object 1 Date Received 96973 non-null object 2 Incident Date 95150 non-null object 3 Airport Code 89207 non-null object 4 Airport Name 89207 non-null object 5 Airline Name 66496 non-null object 6 Claim Type 89593 non-null object 7 Claim Site 96688 non-null object 8 Item 95558 non-null object 9 Claim Amount 93945 non-null float64 10 Status 97231 non-null object 11 Close Amount 89373 non-null float64 12 Disposition 87097 non-null object dtypes: float64(2), object(11)None __________________________ TSA_Claims-2007-2009.xls <class 'pandas.core.frame.DataFrame'> RangeIndex: 47912 entries, 0 to 47911 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Claim Number 47912 non-null int64 1 Date Received 47912 non-null datetime64[ns] 2 Incident Date 47815 non-null object 3 Airport Code 47417 non-null object 4 Airport Name 47417 non-null object 5 Airline Name 44278 non-null object 6 Claim Type 47642 non-null object 7 Claim Site 47720 non-null object 8 Item 45624 non-null object 9 Claim Amount 47160 non-null float64 10 Status 47912 non-null object 11 Close Amount 45942 non-null float64 12 Disposition 44262 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(9)None __________________________ TSA_Claims-2010-2013.xls <class 'pandas.core.frame.DataFrame'> RangeIndex: 41598 entries, 0 to 41597 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Claim Number 41598 non-null object 1 Date Received 41597 non-null datetime64[ns] 2 Incident Date 41597 non-null datetime64[ns] 3 Airport Code 41597 non-null object 4 Airport Name 41597 non-null object 5 Airline Name 41597 non-null object 6 Claim Type 41597 non-null object 7 Claim Site 41597 non-null object 8 Item Category 41597 non-null object 9 Close Amount 41597 non-null object 10 Disposition 41597 non-null object dtypes: datetime64[ns](2), object(9)None __________________________
What we found:
# List the categorical columns we want to check!
cate_cols_to_check = ['Claim Type', 'Claim Site', 'Disposition']
# Add a helper function.
def cate_helper(col1, col2):
name1 = df1_name.split('.')[0]
name2 = df2_name.split('.')[0]
A = set(col1)
B = set(col2)
print('Intersection:', len(A&B))
print(list(A&B)[:7])
print(f'\'{name1}\' - \'{name2}\':' , len(A-B))
print(list(A-B)[:7])
print(f'\'{name1}\' - \'{name2}\':' , len(B-A))
print(list(B-A)[:7])
# There is no significant gap between 2002-2006 and 2007-2009, so skip the comparison of the two.
df1_name = claim_list[1]
df2_name = claim_list[2]
df1 = pd.read_excel(f"data/{df1_name}")
df2 = pd.read_excel(f"data/{df2_name}")
for c in cate_cols_to_check:
print('Check the column:', c)
cate_helper(df1[c], df2[c])
print('-------------------')
Check the column: Claim Type Intersection: 6 [nan, 'Employee Loss (MPCECA)', 'Passenger Property Loss', 'Motor Vehicle', 'Personal Injury', 'Property Damage'] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 1 ['Passenger Theft'] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 4 ['Wrongful Death', 'Complaint', '-', 'Bus Terminal'] ------------------- Check the column: Claim Site Intersection: 6 [nan, 'Checkpoint', 'Motor Vehicle', 'Checked Baggage', 'Bus Station', 'Other'] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 0 [] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 1 ['-'] ------------------- Check the column: Disposition Intersection: 4 [nan, 'Approve in Full', 'Settle', 'Deny'] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 0 [] 'TSA_Claims-2007-2009' - 'TSA_Claims-2010-2013': 1 ['-'] -------------------
# Here we union all three tables.
df = pd.DataFrame()
for i in claim_list:
df = df.append(pd.read_excel(f"data/{i}"))
# Some data cleanings depend on what we found.
df['Date Received'] = pd.to_datetime(df['Date Received'], errors = 'coerce')
df['Incident Date'] = pd.to_datetime(df['Incident Date'], errors = 'coerce')
df = df.dropna(subset = ['Date Received', 'Incident Date'])
df = df.replace('-', np.nan)
# Rephrase some categorical columns for better data viz.
df['Status'] = df['Status'].replace("In litigation", "Litigation").replace("In review", "Reviewing")
df['Status'] = df['Status'].str.split(' ').apply(lambda x :x[0] if isinstance(x, list) else x)
# As we noticed, "Status" is missing after 2010, let's add it back from Disposition
df["Status"].loc[df.Status.isna() & df.Disposition.notna()] = df.loc[df.Status.isna() & df.Disposition.notna()].Disposition \
.replace({'Approve in Full': 'Approved', 'Settle': 'Settled', "Deny": "Denied", '-':"Other"})
# Those three categories have very low cases.
df['Status'] = df['Status'].replace(('Pending', 'Claim', 'Reviewing'), "Other")
df['Status'] = df['Status'].fillna("Other")
df['Claim Type'] = df['Claim Type'].replace('Employee Loss (MPCECA)', "Employee Loss")
df['Claim Type'] = df['Claim Type'].replace('Passenger Property Loss', "Property Loss")
df['Claim Type'] = df['Claim Type'].replace(('Complaint', 'Bus Terminal'), "Other")
df['Claim Type'] = df['Claim Type'].fillna("Other")
df['Claim Site'] = df['Claim Site'].replace('Bus Station', "Other")
df['Claim Site'] = df['Claim Site'].fillna("Other")
df.head(3)
| Claim Number | Date Received | Incident Date | Airport Code | Airport Name | Airline Name | Claim Type | Claim Site | Item | Claim Amount | Status | Close Amount | Disposition | Item Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0909802M | 2002-01-04 | 2002-12-12 | EWR | Newark International Airport | Continental Airlines | Property Damage | Checkpoint | Other | 350.00 | Approved | 350.00 | Approve in Full | NaN |
| 1 | 0202417M | 2002-02-02 | 2004-01-16 | SEA | Seattle-Tacoma International | NaN | Property Damage | Checked Baggage | Luggage (all types including footlockers) | 100.00 | Settled | 50.00 | Settle | NaN |
| 2 | 0202445M | 2002-02-04 | 2003-11-26 | STL | Lambert St. Louis International | American Airlines | Property Damage | Checked Baggage | Cell Phones | 278.88 | Settled | 227.92 | Settle | NaN |
We also can join the airport-related tables. Join all the tables with too many columns is not a good practice, but the size of the dataset is fine here.
# Join airport-related tables.
df = df.drop(columns=['Airport Name'])
air1 = pd.read_csv('data/Airport_Info.csv', index_col = 0).drop(columns=['City'])
df = pd.merge(df, air1, left_on = ['Airport Code'], right_on = ['Airport Code'], how = 'left')
air2 = pd.read_csv('data/US_Airport_States.csv')
df = pd.merge(df, air2, left_on = ['Airport Code'], right_on = ['Airport Code'], how = 'left')
df = df.drop_duplicates(["Claim Number"])
Some observations:
# 'Property Damage' and 'Property Loss' are the top 2 claim types. Let's separate them with other categories.
pc_plot = df[['Claim Type', 'Claim Site', 'Status']].loc[df['Claim Type'].isin(['Property Damage', 'Property Loss', 'Other'])]
dic = {}
for i, v in enumerate(pc_plot["Claim Type"].unique()):
dic[v] = i + 1
pc_plot['color'] = pc_plot["Claim Type"].map(dic)
fig = px.parallel_categories(pc_plot, dimensions=['Claim Type', 'Claim Site', 'Status'], color='color', color_continuous_scale=px.colors.sequential.Inferno)
fig.show()
pc_plot = df[['Claim Type', 'Claim Site', 'Status']].loc[~df['Claim Type'].isin(['Property Damage', 'Property Loss', 'Other'])]
dic = {}
for i, v in enumerate(pc_plot["Claim Type"].unique()):
dic[v] = i
pc_plot['color'] = pc_plot["Claim Type"].map(dic)
fig = px.parallel_categories(pc_plot, dimensions=['Claim Type', 'Claim Site', 'Status'], color='color', color_continuous_scale=px.colors.sequential.Inferno)
fig.show()
# In the column 'Item', ' ;' is used to split items in one claim.
string_list = list(itertools.chain.from_iterable([ i for i in list(df.Item.str.replace(r"\(.*\)","").str.split('; ')) if isinstance(i, list)]))
# Here we take the first word to make the WordCloud graph more concise!
# E.g., 'Clothing - Shoes, belts, accessories, etc.' will be 'Clothing' only.
long_string = ','.join([i.split(' ')[0] for i in string_list])
wordcloud = WordCloud(width=1200, height=600, background_color="white", contour_width=3, contour_color='steelblue', collocations=False)
wordcloud.generate(long_string)
wordcloud.to_image()
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
f = folium.Figure(width=1000, height=500)
m = folium.Map(location=[48, -102], zoom_start=3).add_to(f)
folium.Choropleth(width=500, height=500,
geo_data=state_geo,
name="choropleth",
data=df.groupby("State")["Claim Number"].count().reset_index().rename(columns = {"Claim Number":"#Claims"}),
columns=["State", "#Claims"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="#Claims between 2002~2013",
).add_to(m)
folium.LayerControl().add_to(m)
m
We can observe some outliers in the data from the output below. Either negative value or enormous value is not possible here.
We should drop the outliers first.
df['date_diff'] = df['Date Received'] - df['Incident Date']
print(df['date_diff'].describe())
# Box plot with raw data. Need to drop outliers.
sns.set(rc = {'figure.figsize':(3,2)})
ax = sns.boxplot(df['date_diff'].dt.days)
ax.set(xlabel="Date difference (days)")
ax
count 184247 mean 47 days 21:57:45.969052414 std 692 days 01:35:37.645647184 min -1826 days +00:00:00 25% 12 days 00:00:00 50% 23 days 00:00:00 75% 45 days 00:00:00 max 73134 days 00:00:00 Name: date_diff, dtype: object
<AxesSubplot:xlabel='Date difference (days)'>
def IQR_drop_outlier(df, col):
# Return the mask to identify outliers.
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
return (df[col] > (Q1 - 1.5 * IQR)) & (df[col] < (Q3 + 1.5 * IQR))
# Drop outliers out of 1.5 IQR range or with negative timedelta.
df = df.loc[(df.date_diff > pd.Timedelta(0)) & IQR_drop_outlier(df, 'date_diff')]
print(df['date_diff'].describe())
count 166182 mean 27 days 05:58:09.513304689 std 20 days 15:01:58.664456462 min 0 days 00:45:00 25% 12 days 00:00:00 50% 21 days 00:00:00 75% 38 days 00:00:00 max 94 days 11:30:00 Name: date_diff, dtype: object
What we got from the visualization below:
# By year and by month results.
from numpy import mean
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
sns.set(rc = {'figure.figsize':(8,6)})
fig, axs = plt.subplots(nrows=3)
ax = sns.barplot(y = df['date_diff'].dt.days, x = df['Incident Date'].dt.year, ax=axs[0], estimator=mean)
ax = sns.barplot(y = df['date_diff'].dt.days, x = df['Incident Date'].dt.month, ax=axs[1], estimator=mean)
ax = sns.barplot(y = df['date_diff'].dt.days, x = df['Incident Date'].dt.weekday, ax=axs[2], estimator=mean)
axs[0].set(ylabel="Mean date_diff(day)", xlabel="Year of the claim")
axs[1].set(ylabel="Mean date_diff(day)", xlabel="Month of the claim")
axs[1].set(ylabel="Mean date_diff(day)", xlabel="Weekday of the claim")
fig.tight_layout()
fig.show()
fig, axs = plt.subplots(nrows=2)
ax = sns.boxplot(y = df['date_diff'].dt.days, x = df['Claim Type'], ax=axs[0])
ax = sns.boxplot(y = df['date_diff'].dt.days, x = df['Claim Site'], ax=axs[1])
fig.set_size_inches(9, 5)
fig.tight_layout()
fig.show()
We noticed some extreme values in the column "Claim Amount." They mainly come from the claim type of "Personal Injury". Let's choose the statistic carefully!
# Again, We have some crazy outliers here.
print(df["Claim Amount"].describe())
sns.set(rc = {'figure.figsize':(4,2)})
ax = sns.boxplot(df['Claim Amount'])
ax = ax.set(xlabel="Claim Amount (Dollar)")
count 1.257920e+05 mean 2.385212e+07 std 8.458527e+09 min 0.000000e+00 25% 6.000000e+01 50% 1.699550e+02 75% 4.430000e+02 max 3.000000e+12 Name: Claim Amount, dtype: float64
# Top 5 largest "Claim Amount". All from "Personal Injury".
df.sort_values(by = "Claim Amount", ascending = False).dropna(subset = ['Claim Amount'])[list(df1.columns) + ['State']].head(5)
| Claim Number | Date Received | Incident Date | Airport Code | Airport Name | Airline Name | Claim Type | Claim Site | Item | Claim Amount | Status | Close Amount | Disposition | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 119206 | 2008012238289 | 2008-02-12 | 2007-12-28 | JFK | John F Kennedy International Airport | NaN | Personal Injury | Checkpoint | NaN | 3.000000e+12 | Denied | 0.0 | Deny | NY |
| 68947 | 2005080990257 | 2005-08-01 | 2005-06-13 | PHX | Phoenix Sky Harbor International Airport | Sun Country Airlines Inc | Personal Injury | Checkpoint | Medicines | 1.250000e+08 | Denied | 0.0 | Deny | AZ |
| 86588 | 2006060907675 | 2006-06-06 | 2006-04-19 | LAX | Los Angeles International Airport | America West | Personal Injury | Checked Baggage | Currency; Locks; Other | 1.000000e+08 | Denied | 0.0 | Deny | CA |
| 48260 | 2004122069372 | 2004-11-02 | 2004-09-23 | OKC | Will Rogers World Airport | American Airlines | Personal Injury | Checkpoint | NaN | 2.722500e+07 | Denied | 0.0 | Deny | OK |
| 12590 | 0818500M | 2003-08-18 | 2003-07-14 | JFK | John F Kennedy International Airport | American Airlines | Personal Injury | Checked Baggage | Other | 2.000000e+07 | Denied | 0.0 | Deny | NY |
Solution 1: Result without any processing. (Attention: The number could be misleading due to outliers!)
ans_1 = df.groupby(["State"])["Claim Amount"].mean().sort_values(ascending = False).reset_index()
ans_1[0:1]
| State | Claim Amount | |
|---|---|---|
| 0 | NY | 3.217553e+08 |
Solution 2: Drop the outliers out of 1.5 IQR.
ans_2 = df.loc[IQR_drop_outlier(df, 'Claim Amount')].groupby(["State"])["Claim Amount"].mean().sort_values(ascending = False).reset_index()
ans_2[0:1]
| State | Claim Amount | |
|---|---|---|
| 0 | PR | 310.780077 |
Solution 3: Use median. Usually, Percentiles are more robust than average. (The result is aligned with the second solution perfectly.)
ans_3 = df.groupby(["State"])["Claim Amount"].median().sort_values(ascending = False).reset_index()
ans_3[0:1]
| State | Claim Amount | |
|---|---|---|
| 0 | PR | 375.0 |
Solution 4: Take the mean of approved "Close Amount". (This could be the best statistic if your purpose is to know the average of "reasonable claim value.”)
ans_4 = df.loc[df.Disposition.isin(['Approve in Full', 'Settle'])].groupby(["State"])["Close Amount"].mean().sort_values(ascending = False).reset_index()
ans_4[0:1]
| State | Close Amount | |
|---|---|---|
| 0 | PR | 275.087973 |
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
f = folium.Figure(width=1000, height=500)
m = folium.Map(location=[48, -102], zoom_start=3).add_to(f)
folium.Choropleth(width=500, height=500,
geo_data=state_geo,
name="choropleth",
data=ans_4,
columns=["State", "Close Amount"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Average Claim Amount (USD)",
).add_to(m)
folium.LayerControl().add_to(m)
m
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses. -- Wikipedia "Outlier"
Average is a statistic that could be easily impacted by outliers. In this case, we noticed a data point with a claim amount equal to three trillion USD. (Japan's GDP in 2019 is 5 trillion USD.) The "Claim Amount" value is input manually with arbitrary numbers, and this enormous number could be a "correct input." The mean we computed(solution 1) is also technically correct. But from analysis point of view it is a meaningless and misleading number.
Analyzing data in a top-down manner could be helpful: Clarify your question and choose the statistic with enough prior knowledge. (The picture below explains those two ways well. Source)
df.loc[df['City'] == 'New York'].groupby("Claim Type")["Claim Number"].count().sort_values(ascending = False)[:3]
Claim Type Property Loss 7133 Property Damage 2894 Other 322 Name: Claim Number, dtype: int64
print("Top claimed city")
df.groupby(["City"])["Claim Number"].count().reset_index().sort_values( by = ['Claim Number'], ascending = False).head(4)
Top claimed city
| City | Claim Number | |
|---|---|---|
| 261 | New York | 10429 |
| 216 | Los Angeles | 7639 |
| 64 | Chicago | 6976 |
| 262 | Newark | 5525 |
print("Overall by claim type percentage")
df.groupby("Claim Type")["Claim Number"].count().sort_values(ascending = False)/df.shape[0]
Overall by claim type percentage
Claim Type Property Loss 0.595865 Property Damage 0.358499 Other 0.035726 Personal Injury 0.003707 Employee Loss 0.002521 Passenger Theft 0.002467 Motor Vehicle 0.001216 Name: Claim Number, dtype: float64
print("By claim type percentage in New York")
df.loc[df['City'] == 'New York'].groupby("Claim Type")["Claim Number"].count().sort_values(ascending = False)/df.loc[df['City'] == 'New York'].shape[0]
By claim type percentage in New York
Claim Type Property Loss 0.683958 Property Damage 0.277495 Other 0.030875 Personal Injury 0.003356 Passenger Theft 0.002110 Employee Loss 0.001342 Motor Vehicle 0.000863 Name: Claim Number, dtype: float64
# Let's reuese the code at paragraph 0.3.
pc_plot = df.loc[df['City'] == 'New York'][['Claim Type', 'Claim Site', 'Status']].loc[df['Claim Type'].isin(['Property Loss', 'Property Damage', 'Other'])]
dic = {}
for i, v in enumerate(pc_plot["Claim Type"].unique()):
dic[v] = i
pc_plot['color'] = pc_plot["Claim Type"].map(dic)
fig = px.parallel_categories(pc_plot, dimensions=['Claim Type', 'Claim Site', 'Status'], color='color', color_continuous_scale=px.colors.sequential.Inferno)
fig.show()
# Some preparation for the question.
# Truncat the data to ensure it contains 10 whole years.
q4_df = df.loc[(df['Incident Date']< pd.Timestamp("2013-01-01")) & (df['Incident Date'] >= pd.Timestamp("2003-01-01") )]
q4_df = q4_df.loc[df['Tz Database Category'].str.contains('Pacific', na=False, regex=True)]
q4_df['Incident month'] = q4_df['Incident Date'].dt.month
q4_df['Received month'] = q4_df['Date Received'].dt.month
q4_ans = q4_df.groupby('Incident month')["Claim Number"].count().reset_index()
tmp = q4_df.loc[q4_df.Status.isin(['Approved', 'Settled'])].groupby('Incident month')["Claim Number"].count().reset_index()
q4_ans["Approved Number"] = tmp['Claim Number']
q4_ans["Rejected Number"] = q4_ans["Claim Number"] - q4_ans["Approved Number"]
q4_ans["Approved portation"] = q4_ans["Approved Number"]/q4_ans["Claim Number"]
q4_ans.sort_values(by = "Claim Number", ascending = False)
| Incident month | Claim Number | Approved Number | Rejected Number | Approved portation | |
|---|---|---|---|---|---|
| 0 | 1 | 295 | 135 | 160 | 0.457627 |
| 2 | 3 | 271 | 116 | 155 | 0.428044 |
| 3 | 4 | 271 | 109 | 162 | 0.402214 |
| 5 | 6 | 253 | 102 | 151 | 0.403162 |
| 1 | 2 | 249 | 118 | 131 | 0.473896 |
| 7 | 8 | 243 | 92 | 151 | 0.378601 |
| 4 | 5 | 242 | 100 | 142 | 0.413223 |
| 9 | 10 | 242 | 98 | 144 | 0.404959 |
| 6 | 7 | 236 | 95 | 141 | 0.402542 |
| 8 | 9 | 227 | 101 | 126 | 0.444934 |
| 10 | 11 | 219 | 92 | 127 | 0.420091 |
| 11 | 12 | 213 | 91 | 122 | 0.427230 |
To visualise trends, we used two different approachs:
Both plots describe the peak period for the claims in the Pacific timezone, which is between Jan and May.
plt = q4_ans[6:].append(q4_ans[:6]).plot.bar(x = 'Incident month', y = ["Rejected Number", "Approved Number"], stacked=True, figsize=(6, 2), title = "By month bar plot(Sum of the claims between 2003~2013)")
plt.legend(prop={'size': 6})
<matplotlib.legend.Legend at 0x7fcda89f9bb0>
trend = pd.DataFrame()
trend["Incident Month-day"] = pd.date_range(start="2011-06-01", end="2012-5-31")
trend["Incident Month-day"] = trend["Incident Month-day"].dt.strftime('%m-%d')
tmp = q4_df.copy()
tmp["Incident Month-day"] = tmp["Incident Date"].dt.strftime('%m-%d')
tmp = tmp.groupby("Incident Month-day").count()["Claim Number"].reset_index()
trend = pd.merge(trend, tmp, on = "Incident Month-day", how = 'left')
trend = trend.fillna(0)
trend = trend[-29:].append(trend)
trend["30day-sum"] = trend.rolling(window=30).sum()
trend = trend.dropna(subset = ['30day-sum'])
sns.set(rc = {'figure.figsize':(12,3)})
ax = sns.lineplot(data = trend, x = trend.index, y = '30day-sum', sort = False)
ax.set_xticks([i for i in range(0, 365, 30)][:-1])
months = ['Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May']
ax.set_xticklabels(months)
ax.set_title('By day moving window trend plot(Sum of the claims between 2003~2013)')
ax
<AxesSubplot:title={'center':'By day moving window trend plot(Sum of the claims between 2003~2013)'}, ylabel='30day-sum'>
The peak period is directly linked to the travel season of Hawaii. The peak tourism season starts in the middle of December. However, the incident happened at the return timing, which is shifted to early Jan.
The peak tourism season in Hawaii typically starts in the middle of December and continues until the end of March or mid-April. The off-season stretches from the middle of April and continues until mid-June, and resumes again from September until crowds tick up before the holidays.
Data source: hawaii.gov
If we only look at the "rejection rate" estimated by historical data, we may notice that the highest one has a small sample size.
We can either:
Set a minimum #claims threshold empirically or use a sample size calculator.
Decide it statistically.
# Strip white space to Clean the column.
df["Airline Name"] = df["Airline Name"].str.replace(" ","")
q5_ans = df.groupby("Airline Name")["Claim Number"].count().reset_index(name='count')
q5_ans = pd.merge(q5_ans, df.groupby("Airline Name")["Status"].apply(lambda x: (x=='Denied').sum()).reset_index(name='denied_count'), on = "Airline Name")
q5_ans["denied_rate"] = q5_ans.denied_count / q5_ans["count"]
q5_ans.sort_values(by = ["denied_rate", "denied_count"], ascending = False).head(5)
| Airline Name | count | denied_count | denied_rate | |
|---|---|---|---|---|
| 218 | XtraAirways | 7 | 7 | 1.0 |
| 69 | CameroonAirlines | 5 | 5 | 1.0 |
| 113 | HarmonyAirlines | 3 | 3 | 1.0 |
| 166 | RoyalAirMaroc | 3 | 3 | 1.0 |
| 5 | AeroGal | 2 | 2 | 1.0 |
# Scatter plot to visualize the rejection vs claims.
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter( x = q5_ans['count'], y = q5_ans['denied_rate'], text=q5_ans['Airline Name'],
mode='markers',
name='markers'))
fig.update_layout(title={
'text': "Scatter plot rejection rate v.s. total claims", 'x': 0.5}, xaxis_title="Total claims", yaxis_title='Rejection rate')
# Find the candidates for test.
# Those data points are the arilines with highest #claims in certain rejection rate.
q5_check = q5_ans.sort_values(by = ["denied_rate", "count"], ascending = False).groupby('denied_rate').first().reset_index().sort_values("denied_rate", ascending = False)
q5_check["last_largest_count"] = q5_check.rolling(window=len(q5_check), min_periods=1)["count"].max()
check_list = q5_check.loc[q5_check['count'] >= q5_check.last_largest_count]
check_list
| denied_rate | Airline Name | count | denied_count | last_largest_count | |
|---|---|---|---|---|---|
| 114 | 1.000000 | XtraAirways | 7 | 7 | 7.0 |
| 112 | 0.791667 | AllNipponAirways | 24 | 19 | 24.0 |
| 108 | 0.750000 | CaymanAirlines | 24 | 18 | 24.0 |
| 107 | 0.724138 | AirChina | 29 | 21 | 29.0 |
| 106 | 0.717391 | AirIndia | 46 | 33 | 46.0 |
| 103 | 0.703125 | ChinaAirlines | 64 | 45 | 64.0 |
| 96 | 0.663793 | LanAirlines | 116 | 77 | 116.0 |
| 94 | 0.655172 | KLMRoyalDutchAirlines | 203 | 133 | 203.0 |
| 84 | 0.621978 | AirFrance | 455 | 283 | 455.0 |
| 79 | 0.600726 | VirginAtlantic | 551 | 331 | 551.0 |
| 74 | 0.587591 | AllegiantAir | 822 | 483 | 822.0 |
| 69 | 0.581093 | JetBlue | 5765 | 3350 | 5765.0 |
| 50 | 0.542617 | ContinentalAirlines | 10348 | 5615 | 10348.0 |
| 44 | 0.530605 | DeltaAirLines | 20797 | 11035 | 20797.0 |
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter( x = q5_ans['count'], y = q5_ans['denied_rate'], text=q5_ans['Airline Name'],
mode='markers',
name='All airlines'))
fig.add_trace(go.Scatter(x = check_list['count'], y = check_list['denied_rate'], text=check_list['Airline Name'],
mode='lines+markers',
name='Candidates for Two-Sample Proportions test'))
fig.update_traces(textposition="bottom right")
fig.update_xaxes(range = [0,6000])
fig.update_layout(title={
'text': "Scatter plot rejection rate v.s. total claims", 'x': 0.5}, xaxis_title="Total claims", yaxis_title='Rejection rate')
fig.show()
The rejection rate for "Jet Blue" is 58.1%. (p-value = 0.05)
# We compare the candidates by pair in descending order.
# Continental Airlines has the largest #claims with no significant difference in rejection rate compared to previews one.
from statsmodels.stats.proportion import proportions_ztest
tmp = check_list.values
for i, arr in enumerate(tmp):
if i == len(tmp) - 1:
break
count = np.array([tmp[i][3], tmp[i+1][3]])
nobs = np.array([tmp[i][2], tmp[i+1][2]])
stat, pval = proportions_ztest(count, nobs)
print('pairs for the test are: ', tmp[i][1], "and" ,tmp[i+1][1])
print("p value:", pval)
print('_______________________')
pairs for the test are: XtraAirways and AllNipponAirways p value: 0.18729322290848294 _______________________ pairs for the test are: AllNipponAirways and CaymanAirlines p value: 0.7312838061826907 _______________________ pairs for the test are: CaymanAirlines and AirChina p value: 0.8316594995770759 _______________________ pairs for the test are: AirChina and AirIndia p value: 0.9494712957383331 _______________________ pairs for the test are: AirIndia and ChinaAirlines p value: 0.8709118117487853 _______________________ pairs for the test are: ChinaAirlines and LanAirlines p value: 0.5888432410741276 _______________________ pairs for the test are: LanAirlines and KLMRoyalDutchAirlines p value: 0.8759007636054557 _______________________ pairs for the test are: KLMRoyalDutchAirlines and AirFrance p value: 0.41472914863721544 _______________________ pairs for the test are: AirFrance and VirginAtlantic p value: 0.491486647930489 _______________________ pairs for the test are: VirginAtlantic and AllegiantAir p value: 0.627273992821891 _______________________ pairs for the test are: AllegiantAir and JetBlue p value: 0.7238065469458539 _______________________ pairs for the test are: JetBlue and ContinentalAirlines p value: 2.4488197647418275e-06 _______________________ pairs for the test are: ContinentalAirlines and DeltaAirLines p value: 0.045313662900351874 _______________________
q5_dis = df.loc[df["Airline Name"] == "JetBlue"]
q5_dis['Incident year'] = q5_dis["Incident Date"].dt.year
tmp = q5_dis.groupby("Incident year")["Status"].apply(lambda x: (x!='Denied').sum()).reset_index(name='Approved_count')
tmp = pd.merge(tmp, q5_dis.groupby("Incident year")["Status"].apply(lambda x: (x=='Denied').sum()).reset_index(name='Denied_count'), on = "Incident year")
plt = tmp.plot.bar(x = 'Incident year', y = ["Approved_count", "Denied_count"], stacked=True, figsize=(6, 2), title = "JetBlue by year(Claims between 2002~2013)")
plt.legend(prop={'size': 6})
<matplotlib.legend.Legend at 0x7fcdd0cd7e80>
I have put my EDA process at "0.3 Simple EDA before the start!". In this paragraph, I will:
# Prepare trainset.
df = df.dropna(subset = ["Item", "Status"])[list(df1.columns)+['DST', 'Tz Database Category', 'Airport Type','State', 'City']]
target_col = "Status"
category_cols = ["Airport Code", "Airline Name", "Claim Type", "Claim Site", 'DST', 'Tz Database Category', 'Airport Type','State', 'City']
numerical_cols = ["Claim Amount"]
# Some "Items" related features.
df["Items"] = df.Item.str.split('; ')
df["Items_count"] = df.Items.apply(lambda x:len(x) if isinstance(x, list) else 0)
numerical_cols.append("Items_count")
# Add one hot encoder from items.
items = list(set(itertools.chain.from_iterable([ i for i in list(df.Item.str.split('; ')) if isinstance(i, list)])))
item_mapping = {}
for i, item in enumerate(items):
col_name = f"item_{i}"
item_mapping[col_name] = item
df[col_name] = df.Items.apply(lambda x:1 if isinstance(x, list) and item in x else 0)
# Feature selection: Select the item category with size > 100.
numerical_cols += list(df[item_mapping.keys()].sum().reset_index(name = 'count').query("count > 1000")['index'])
# Now we have Items related features!
df[numerical_cols].head()
| Claim Amount | Items_count | item_1 | item_6 | item_12 | item_13 | item_23 | item_32 | item_40 | item_51 | item_75 | item_76 | item_94 | item_103 | item_105 | item_117 | item_136 | item_137 | item_139 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 75.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 2270.09 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 9 | 4457.29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 16.71 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 450.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
# Encode category_features here.
from sklearn import preprocessing
category_features = []
for col in category_cols:
le = preprocessing.LabelEncoder()
df['encoded_'+col] = le.fit_transform(df[col])
category_features.append('encoded_'+col)
df[category_features].head()
| encoded_Airport Code | encoded_Airline Name | encoded_Claim Type | encoded_Claim Site | encoded_DST | encoded_Tz Database Category | encoded_Airport Type | encoded_State | encoded_City | |
|---|---|---|---|---|---|---|---|---|---|
| 7 | 64 | 157 | 2 | 1 | 0 | 8 | 1 | 23 | 21 |
| 8 | 307 | 187 | 5 | 1 | 1 | 9 | 1 | 4 | 275 |
| 9 | 24 | 187 | 6 | 1 | 0 | 8 | 1 | 12 | 17 |
| 12 | 64 | 40 | 2 | 1 | 0 | 8 | 1 | 23 | 21 |
| 13 | 64 | 157 | 2 | 3 | 0 | 8 | 1 | 23 | 21 |
# Add date related features.
df["Month_Received"] = df["Date Received"].dt.month
df["DayMonth_Received"] = df["Date Received"].dt.day
df["DayYear_Received"] = df["Date Received"].dt.dayofyear
df["Incident_Month"] = df["Incident Date"].dt.month
df["Incident_DayMonth"] = df["Incident Date"].dt.day
df["Incident_DayYear"] = df["Incident Date"].dt.dayofyear
df["Date_diff"] = (df["Date Received"] - df["Incident Date"]).dt.days
numerical_cols += ["Date_diff",
"Month_Received","DayYear_Received","DayMonth_Received",
"Incident_Month","Incident_DayYear","Incident_DayMonth"]
# Process target column.
df = df.loc[df.Status.isin(['Approved', 'Settled', 'Denied'])]
df["target"] = df.Status.apply(lambda x: 1 if x in ('Approved', 'Settled') else 0 )
# Split trian, valid and test set.
train_end = pd.to_datetime("2008-06-01")
valid_end = pd.to_datetime("2009-01-01")
train = df.loc[df["Date Received"] <= train_end]
valid = df.loc[(df["Date Received"] > train_end) & (df["Date Received"] <= valid_end)]
test = df.loc[df["Date Received"] > valid_end]
# Build a baseline by Random Forest.
from sklearn.ensemble import RandomForestClassifier
features = numerical_cols + category_features
target = "target"
train_df = train[[target]+features].dropna()
X = np.array(train_df[features])
y = np.array(train_df[target])
model = RandomForestClassifier(n_estimators=250, min_samples_split=10, n_jobs=-1)
model.fit(X, y)
RandomForestClassifier(min_samples_split=10, n_estimators=250, n_jobs=-1)
# AUC is around 70%, not bad.
from sklearn import metrics
test_df = test[[target]+features].dropna()
test_df["baseline"] = list(pd.DataFrame(model.predict_proba(test_df[features]))[1])
fpr, tpr, thresholds = metrics.roc_curve(test_df.target, test_df["baseline"], pos_label=1)
metrics.auc(fpr, tpr)
0.7011761250958184
# A simple LGBM model, without much tuning.
import lightgbm as lgb
clf = lgb.LGBMClassifier(n_estimators=3000)
clf.fit(X, y, feature_name=features, categorical_feature = category_features, eval_metric='auc',eval_set = (valid[features], valid[target]), early_stopping_rounds=2000, verbose=500)
Training until validation scores don't improve for 2000 rounds [500] valid_0's auc: 0.682544 valid_0's binary_logloss: 0.636809 [1000] valid_0's auc: 0.681826 valid_0's binary_logloss: 0.649484 [1500] valid_0's auc: 0.680161 valid_0's binary_logloss: 0.660744 [2000] valid_0's auc: 0.679324 valid_0's binary_logloss: 0.670789 Early stopping, best iteration is: [54] valid_0's auc: 0.680788 valid_0's binary_logloss: 0.619585
LGBMClassifier(n_estimators=3000)
# Not much improvement...
# Better to tune the model if time is enough.
test_df["lgb"] = list(pd.DataFrame(clf.predict_proba(test_df[features]))[1])
fpr, tpr, thresholds = metrics.roc_curve(test_df.target, test_df["lgb"], pos_label=1)
metrics.auc(fpr, tpr)
0.7085666717626535
# Feature importance.
# It is interesting to figure out that categorical features are essential. Especially the Airport.
import matplotlib
matplotlib.rcParams.update(matplotlib.rcParamsDefault)
lgb.plot_importance(clf)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
# SHAP value is helpful for understanding numerical features.
import shap
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 100
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(train_df[features])
shap.summary_plot(shap_values[1], train_df[features].rename(columns=item_mapping), plot_type='dot', plot_size = (8,7))
Both Random Forest and LightGBM models have the AUC around 70% for the classification task. The current performance is acceptable, but we could improve it after some parameter tuning.
From the feature importance and SHAP value, we got: